To cite Nate Bargatze, “Hello World!”
As you can see above, my name is Kevin Schell. I am a practicing physical therapist with over twelve years of clinical experience working with adults admitted to a major academic medical center in Boston. I am a board-certified clinical specialist in geriatric physical therapy and have provided physical therapy care to adults across all specialties including critical care environments and the hospital’s emergency department. In fact, my time working in the emergency department ignited a new passion for me to take my clinical experience and (hopefully) positively impact the patient experience at a different level from one-on-one patient care. To this end, I am currently pursuing a Master of Science in Healthcare Data Analytics at the MGH Institute of Health Professions.
To supplement my learning, I have used different publicly available datasets to practice visualization techniques while looking for potential associations to further investigate with statistical analyses. Most recently, I utilized Employee Attrition from Healthcare from Kaggle. In particular, I worked through logistic regression modeling to identify independent risk factors associated with healthcare works leaving their jobs. Please feel free to peruse the rest of this site to see my efforts related to this project.
The Employee Attrition from Healthcare dataset includes fabricated data. Overall, there are nearly 1,700 observations with 34 variables with near flawless data capture (i.e. minimal missing values and no duplicate observations). The primary outcome of interest for this analysis involves whether or not an individual left their job, captured in the variable Attrition. The potential independent variables include demographic information, work experiences, and quality of life measures. The author of the dataset referenced the US healthcare system. Given this, I am making some assumptions regarding units of measure. For example, I am assuming DistanceFromHome is captured in miles as opposed to kilometers, and compensation values are reported in US Dollars. I am also assuming that the self report measures follow typical 4-point Likert scales, however, this is never explicitly stated.
Several of the variables were interpreted as the wrong variable type after being read into R. For example, the main outcome of interest for this dataset Attrition was initially treated as a character/string variable. I converted this to a factor. This was a recurring issue across several variables. To reduce the volume of coding required, I converted the string/character variables to factor variables and then later updated ordinal variables with their respective levels respectively.
I also relabeled the JobRole level Admin to Administrative. This Admin level presented a bit of an issue as it is unclear whether it reflects a truly different level from Administrative versus represents an issue in how the information was collected (i.e. free text submission). I could have simply removed the limited number of observations with JobRole == Admin, but I decided to treat Admin as representing the same role group as Administrative.
if (!require(tidyverse)) install.packages("tidyverse")
## Loading required package: tidyverse
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(tidyverse)
setwd("C:\\Users\\kjsch\\OneDrive\\Documents\\Schell Data Analytics Portfolio\\Spring 2024 Data Analytics Project")
att <- read.csv("watson_healthcare_modified.csv")
att$Attrition<-as.factor(att$Attrition)
att <- att %>%
mutate(across(where(is.character), as.factor))
att$JobRole<-recode_factor(att$JobRole, "Admin" = "Administrative")
att$EnvironmentSatisfaction<-factor(att$EnvironmentSatisfaction, ordered = TRUE, levels = c("1", "2", "3", "4"))
att$JobLevel<-factor(att$JobLevel, ordered = TRUE, levels = c("1", "2", "3", "4"))
att$JobInvolvement<-factor(att$JobInvolvement, ordered = TRUE, levels = c("1", "2", "3", "4"))
att$JobSatisfaction<-factor(att$JobSatisfaction, ordered = TRUE, levels = c("1", "2", "3", "4"))
att$RelationshipSatisfaction<-factor(att$RelationshipSatisfaction, ordered = TRUE, level = c("1", "2", "3", "4"))
att$PerformanceRating<-factor(att$PerformanceRating, ordered = TRUE, levels = c("1", "2", "3", "4"))
att$BusinessTravel<-factor(att$BusinessTravel, ordered = TRUE, levels = c("Non-Travel", "Travel_Rarely", "Travel_Frequently"))
att$WorkLifeBalance<-factor(att$WorkLifeBalance, ordered = TRUE, levels = c("1", "2", "3", "4"))
att$Shift<-as.factor(att$Shift)
Out of the 34 variables included in this dataset, only JobLevel contained missing values. There are 81 observations with “NA” as their value for JobLevel. It is possible that “NA” accurately reflects these observations’ JobLevel. In other words, there may be 81 respondents whose roles do not include ascending levels. However, it is also possible that some respondents may have inadvertently failed to respond to this question, or there may have been a transcription error when this dataset was generated from the raw data. Without knowing how this fabricated data was “collected,” it is impossible to know for certain.
Below you will find my initial analysis of missingness using the mice package. The analysis confirmed that only JobLevel had missing values. Summaries of datasets with all observations and one with observations removed for missing JobLevel are included below as well. Had the differences been more egregious, I would have considered imputation of JobLevel for those with missing levels. However, given the lack of clarity regarding the meaning of NA combined with only subtle differences and relatively small percentage of missing values, I decided I would remove the observations with missing values when specifying models.
if (!require(mice)) install.packages("mice")
## Loading required package: mice
##
## Attaching package: 'mice'
## The following object is masked from 'package:stats':
##
## filter
## The following objects are masked from 'package:base':
##
## cbind, rbind
library(mice)
md.pattern(att)
## EmployeeID Age Attrition BusinessTravel DailyRate Department
## 1595 1 1 1 1 1 1
## 81 1 1 1 1 1 1
## 0 0 0 0 0 0
## DistanceFromHome Education EducationField EmployeeCount
## 1595 1 1 1 1
## 81 1 1 1 1
## 0 0 0 0
## EnvironmentSatisfaction Gender HourlyRate JobInvolvement JobRole
## 1595 1 1 1 1 1
## 81 1 1 1 1 1
## 0 0 0 0 0
## JobSatisfaction MaritalStatus MonthlyIncome MonthlyRate NumCompaniesWorked
## 1595 1 1 1 1 1
## 81 1 1 1 1 1
## 0 0 0 0 0
## Over18 OverTime PercentSalaryHike PerformanceRating
## 1595 1 1 1 1
## 81 1 1 1 1
## 0 0 0 0
## RelationshipSatisfaction StandardHours Shift TotalWorkingYears
## 1595 1 1 1 1
## 81 1 1 1 1
## 0 0 0 0
## TrainingTimesLastYear WorkLifeBalance YearsAtCompany YearsInCurrentRole
## 1595 1 1 1 1
## 81 1 1 1 1
## 0 0 0 0
## YearsSinceLastPromotion YearsWithCurrManager JobLevel
## 1595 1 1 1 0
## 81 1 1 0 1
## 0 0 81 81
att.continuous <- att %>%
select(Age, DailyRate, DistanceFromHome, Education, HourlyRate, MonthlyIncome, MonthlyRate, NumCompaniesWorked, PercentSalaryHike, StandardHours, TotalWorkingYears, YearsAtCompany, YearsInCurrentRole,YearsSinceLastPromotion, YearsWithCurrManager)
summary(att.continuous)
## Age DailyRate DistanceFromHome Education
## Min. :18.00 Min. : 102.0 Min. : 1.000 Min. :1.000
## 1st Qu.:30.00 1st Qu.: 465.0 1st Qu.: 2.000 1st Qu.:2.000
## Median :36.00 Median : 796.5 Median : 7.000 Median :3.000
## Mean :36.87 Mean : 800.6 Mean : 9.222 Mean :2.908
## 3rd Qu.:43.00 3rd Qu.:1157.0 3rd Qu.:14.000 3rd Qu.:4.000
## Max. :60.00 Max. :1499.0 Max. :29.000 Max. :5.000
## HourlyRate MonthlyIncome MonthlyRate NumCompaniesWorked
## Min. : 30.00 Min. : 1009 Min. : 2094 Min. :0.000
## 1st Qu.: 48.00 1st Qu.: 2928 1st Qu.: 7993 1st Qu.:1.000
## Median : 65.50 Median : 4899 Median :14270 Median :2.000
## Mean : 65.47 Mean : 6517 Mean :14287 Mean :2.662
## 3rd Qu.: 83.00 3rd Qu.: 8380 3rd Qu.:20462 3rd Qu.:4.000
## Max. :100.00 Max. :19999 Max. :26999 Max. :9.000
## PercentSalaryHike StandardHours TotalWorkingYears YearsAtCompany
## Min. :11.0 Min. :80 Min. : 0.00 Min. : 0.000
## 1st Qu.:12.0 1st Qu.:80 1st Qu.: 6.00 1st Qu.: 3.000
## Median :14.0 Median :80 Median :10.00 Median : 5.000
## Mean :15.2 Mean :80 Mean :11.34 Mean : 7.033
## 3rd Qu.:18.0 3rd Qu.:80 3rd Qu.:15.00 3rd Qu.:10.000
## Max. :25.0 Max. :80 Max. :40.00 Max. :40.000
## YearsInCurrentRole YearsSinceLastPromotion YearsWithCurrManager
## Min. : 0.000 Min. : 0.0 Min. : 0.000
## 1st Qu.: 2.000 1st Qu.: 0.0 1st Qu.: 2.000
## Median : 3.000 Median : 1.0 Median : 3.000
## Mean : 4.265 Mean : 2.2 Mean : 4.135
## 3rd Qu.: 7.000 3rd Qu.: 3.0 3rd Qu.: 7.000
## Max. :18.000 Max. :15.0 Max. :17.000
att.nomord<-att%>%
select(!c(EmployeeID, Age, DailyRate, DistanceFromHome, Education, HourlyRate, MonthlyIncome, MonthlyRate, NumCompaniesWorked, PercentSalaryHike, StandardHours, TotalWorkingYears, TrainingTimesLastYear, YearsAtCompany, YearsInCurrentRole,YearsSinceLastPromotion, YearsWithCurrManager))
summary(att.nomord)
## Attrition BusinessTravel Department EducationField
## No :1477 Non-Travel : 172 Cardiology:531 Human Resources : 29
## Yes: 199 Travel_Rarely :1184 Maternity :796 Life Sciences :697
## Travel_Frequently: 320 Neurology :349 Marketing :189
## Medical :524
## Other : 88
## Technical Degree:149
## EmployeeCount EnvironmentSatisfaction Gender JobInvolvement JobLevel
## Min. :1 1:330 Female:678 1: 96 1 :621
## 1st Qu.:1 2:326 Male :998 2:433 2 :606
## Median :1 3:512 3:983 3 :246
## Mean :1 4:508 4:164 4 :122
## 3rd Qu.:1 NA's: 81
## Max. :1
## JobRole JobSatisfaction MaritalStatus Over18 OverTime
## Administrative:131 1:329 Divorced:377 Y:1676 No :1200
## Nurse :822 2:310 Married :777 Yes: 476
## Other :534 3:507 Single :522
## Therapist :189 4:530
##
##
## PerformanceRating RelationshipSatisfaction Shift WorkLifeBalance
## 1: 0 1:310 0:708 1: 90
## 2: 0 2:346 1:684 2: 385
## 3:1424 3:526 2:185 3:1028
## 4: 252 4:494 3: 99 4: 173
##
##
ggplot(data = att, aes(x = JobLevel, fill = Attrition)) +
geom_bar(color = "dimgrey") +
theme_test() +
scale_fill_manual(values = c("red4", "green4")) +
labs(
title = "Attrition by Job Level",
subtitle = ""
)
addmargins(table(att$JobLevel,att$Attrition))
##
## No Yes Sum
## 1 479 142 621
## 2 572 34 606
## 3 228 18 246
## 4 119 3 122
## Sum 1398 197 1595
att.na <- na.omit(att)
att.na.continuous <- att.na %>%
select(Age, DailyRate, DistanceFromHome, Education, HourlyRate, MonthlyIncome, MonthlyRate, NumCompaniesWorked, PercentSalaryHike, StandardHours, TotalWorkingYears, YearsAtCompany, YearsInCurrentRole,YearsSinceLastPromotion, YearsWithCurrManager)
summary(att.na.continuous)
## Age DailyRate DistanceFromHome Education
## Min. :18.00 Min. : 102.0 Min. : 1.000 Min. :1.000
## 1st Qu.:30.00 1st Qu.: 465.0 1st Qu.: 2.000 1st Qu.:2.000
## Median :35.00 Median : 796.0 Median : 7.000 Median :3.000
## Mean :36.31 Mean : 799.6 Mean : 9.389 Mean :2.905
## 3rd Qu.:42.00 3rd Qu.:1154.0 3rd Qu.:15.000 3rd Qu.:4.000
## Max. :60.00 Max. :1499.0 Max. :29.000 Max. :5.000
## HourlyRate MonthlyIncome MonthlyRate NumCompaniesWorked
## Min. : 30.00 Min. : 1009 Min. : 2094 Min. :0.000
## 1st Qu.: 48.00 1st Qu.: 2860 1st Qu.: 7932 1st Qu.:1.000
## Median : 66.00 Median : 4724 Median :14284 Median :1.000
## Mean : 65.44 Mean : 5874 Mean :14290 Mean :2.622
## 3rd Qu.: 83.00 3rd Qu.: 7365 3rd Qu.:20469 3rd Qu.:4.000
## Max. :100.00 Max. :17924 Max. :26999 Max. :9.000
## PercentSalaryHike StandardHours TotalWorkingYears YearsAtCompany
## Min. :11.00 Min. :80 Min. : 0.00 Min. : 0.000
## 1st Qu.:12.00 1st Qu.:80 1st Qu.: 6.00 1st Qu.: 3.000
## Median :14.00 Median :80 Median : 9.00 Median : 5.000
## Mean :15.25 Mean :80 Mean :10.59 Mean : 6.643
## 3rd Qu.:18.00 3rd Qu.:80 3rd Qu.:14.00 3rd Qu.: 9.000
## Max. :25.00 Max. :80 Max. :40.00 Max. :40.000
## YearsInCurrentRole YearsSinceLastPromotion YearsWithCurrManager
## Min. : 0.000 Min. : 0.000 Min. : 0.000
## 1st Qu.: 2.000 1st Qu.: 0.000 1st Qu.: 2.000
## Median : 3.000 Median : 1.000 Median : 3.000
## Mean : 4.153 Mean : 2.092 Mean : 4.003
## 3rd Qu.: 7.000 3rd Qu.: 2.000 3rd Qu.: 7.000
## Max. :17.000 Max. :15.000 Max. :17.000
att.na.nomord<-att.na%>%
select(!c(EmployeeID, Age, DailyRate, DistanceFromHome, Education, HourlyRate, MonthlyIncome, MonthlyRate, NumCompaniesWorked, PercentSalaryHike, StandardHours, TotalWorkingYears, TrainingTimesLastYear, YearsAtCompany, YearsInCurrentRole,YearsSinceLastPromotion, YearsWithCurrManager))
summary(att.na.nomord)
## Attrition BusinessTravel Department EducationField
## No :1398 Non-Travel : 168 Cardiology:515 Human Resources : 24
## Yes: 197 Travel_Rarely :1119 Maternity :777 Life Sciences :666
## Travel_Frequently: 308 Neurology :303 Marketing :180
## Medical :495
## Other : 84
## Technical Degree:146
## EmployeeCount EnvironmentSatisfaction Gender JobInvolvement JobLevel
## Min. :1 1:313 Female:649 1: 90 1:621
## 1st Qu.:1 2:311 Male :946 2:413 2:606
## Median :1 3:492 3:932 3:246
## Mean :1 4:479 4:160 4:122
## 3rd Qu.:1
## Max. :1
## JobRole JobSatisfaction MaritalStatus Over18 OverTime
## Administrative: 83 1:316 Divorced:357 Y:1595 No :1140
## Nurse :822 2:293 Married :735 Yes: 455
## Other :506 3:476 Single :503
## Therapist :184 4:510
##
##
## PerformanceRating RelationshipSatisfaction Shift WorkLifeBalance
## 1: 0 1:298 0:678 1: 89
## 2: 0 2:332 1:639 2:367
## 3:1351 3:499 2:181 3:975
## 4: 244 4:466 3: 97 4:164
##
##
att<-att.na
Continuous Variables
Based on the basic descriptive statistics, the sample at large reflects a relatively young group of workers that skews older that live relatively close to home and earn an average hourly rate well above the average for US healthcare workers (US Bureau of Labor Statistics, 2023). Additionally, most participants have worked at only one company, however, the sample is skewed greatly by those who have worked for several organizations. Interestingly, most participants have been promoted within the last year.
att.continuous <- att %>%
select(Age, DailyRate, DistanceFromHome, Education, HourlyRate, MonthlyIncome, MonthlyRate, NumCompaniesWorked, PercentSalaryHike, StandardHours, TotalWorkingYears, YearsAtCompany, YearsInCurrentRole,YearsSinceLastPromotion, YearsWithCurrManager)
summary(att.continuous)
## Age DailyRate DistanceFromHome Education
## Min. :18.00 Min. : 102.0 Min. : 1.000 Min. :1.000
## 1st Qu.:30.00 1st Qu.: 465.0 1st Qu.: 2.000 1st Qu.:2.000
## Median :35.00 Median : 796.0 Median : 7.000 Median :3.000
## Mean :36.31 Mean : 799.6 Mean : 9.389 Mean :2.905
## 3rd Qu.:42.00 3rd Qu.:1154.0 3rd Qu.:15.000 3rd Qu.:4.000
## Max. :60.00 Max. :1499.0 Max. :29.000 Max. :5.000
## HourlyRate MonthlyIncome MonthlyRate NumCompaniesWorked
## Min. : 30.00 Min. : 1009 Min. : 2094 Min. :0.000
## 1st Qu.: 48.00 1st Qu.: 2860 1st Qu.: 7932 1st Qu.:1.000
## Median : 66.00 Median : 4724 Median :14284 Median :1.000
## Mean : 65.44 Mean : 5874 Mean :14290 Mean :2.622
## 3rd Qu.: 83.00 3rd Qu.: 7365 3rd Qu.:20469 3rd Qu.:4.000
## Max. :100.00 Max. :17924 Max. :26999 Max. :9.000
## PercentSalaryHike StandardHours TotalWorkingYears YearsAtCompany
## Min. :11.00 Min. :80 Min. : 0.00 Min. : 0.000
## 1st Qu.:12.00 1st Qu.:80 1st Qu.: 6.00 1st Qu.: 3.000
## Median :14.00 Median :80 Median : 9.00 Median : 5.000
## Mean :15.25 Mean :80 Mean :10.59 Mean : 6.643
## 3rd Qu.:18.00 3rd Qu.:80 3rd Qu.:14.00 3rd Qu.: 9.000
## Max. :25.00 Max. :80 Max. :40.00 Max. :40.000
## YearsInCurrentRole YearsSinceLastPromotion YearsWithCurrManager
## Min. : 0.000 Min. : 0.000 Min. : 0.000
## 1st Qu.: 2.000 1st Qu.: 0.000 1st Qu.: 2.000
## Median : 3.000 Median : 1.000 Median : 3.000
## Mean : 4.153 Mean : 2.092 Mean : 4.003
## 3rd Qu.: 7.000 3rd Qu.: 2.000 3rd Qu.: 7.000
## Max. :17.000 Max. :15.000 Max. :17.000
Nominal and Ordinal Variables
The sample is imbalanced across the outcome of interest, Attrition. Additionally, male gender is grossly overrepresented. Nurse is the most common role identified, and most participants are single and early in their careers. Interestingly, most participants rate their WorkLifeBalance positively.
att.nomord<-att%>%
select(!c(EmployeeID, Age, DailyRate, DistanceFromHome, Education, HourlyRate, MonthlyIncome, MonthlyRate, NumCompaniesWorked, PercentSalaryHike, StandardHours, TotalWorkingYears, TrainingTimesLastYear, YearsAtCompany, YearsInCurrentRole,YearsSinceLastPromotion, YearsWithCurrManager))
summary(att.nomord)
## Attrition BusinessTravel Department EducationField
## No :1398 Non-Travel : 168 Cardiology:515 Human Resources : 24
## Yes: 197 Travel_Rarely :1119 Maternity :777 Life Sciences :666
## Travel_Frequently: 308 Neurology :303 Marketing :180
## Medical :495
## Other : 84
## Technical Degree:146
## EmployeeCount EnvironmentSatisfaction Gender JobInvolvement JobLevel
## Min. :1 1:313 Female:649 1: 90 1:621
## 1st Qu.:1 2:311 Male :946 2:413 2:606
## Median :1 3:492 3:932 3:246
## Mean :1 4:479 4:160 4:122
## 3rd Qu.:1
## Max. :1
## JobRole JobSatisfaction MaritalStatus Over18 OverTime
## Administrative: 83 1:316 Divorced:357 Y:1595 No :1140
## Nurse :822 2:293 Married :735 Yes: 455
## Other :506 3:476 Single :503
## Therapist :184 4:510
##
##
## PerformanceRating RelationshipSatisfaction Shift WorkLifeBalance
## 1: 0 1:298 0:678 1: 89
## 2: 0 2:332 1:639 2:367
## 3:1351 3:499 2:181 3:975
## 4: 244 4:466 3: 97 4:164
##
##
Please see each visualization’s subtitle for takeaway messages
ggplot(data = att, aes(Age)) +
geom_histogram(binwidth = 2, color = "grey65",fill = "hotpink4") +
geom_vline(aes(xintercept = median(Age)), color = "black", linewidth = 1.25) +
theme_test()+
labs(
title = "Age of Respondents",
subtitle = "Skews Older",
x = "Age in Years",
y = "Count of Respondents",
caption = "Median Represented with Line"
)
ggplot(data = att, aes(x = "", y=Age)) +
geom_violin(color = "hotpink4") +
geom_jitter(shape = 21, alpha = 0.25, color = "hotpink4") +
theme_test() +
theme(
axis.ticks.x = element_blank(),
axis.text.x = element_blank()
) +
geom_boxplot(width = 0.15, color = "hotpink4") +
labs(
title = "Age of Respondents",
subtitle = "Older Skew Apparent in Shape",
x = "All Respondents",
y = "Age in Years"
)
ggplot(data = att, aes(x = JobRole, fill = JobRole)) +
geom_bar(color = "dimgrey", show.legend = FALSE) +
theme_test() +
scale_fill_manual(values = c("dodgerblue4", "hotpink4", "lightyellow4", "seagreen4")) +
labs(
title = "JobRole All Respondents",
subtitle = "Nursing Role is Most Represented"
)
ggplot(data = att, aes(Age, fill = JobRole)) +
geom_histogram(binwidth = 2, color = "dimgrey", show.legend = FALSE) +
facet_wrap(~att$JobRole) +
theme_test()+
scale_fill_manual(values = c("dodgerblue4", "hotpink4", "lightyellow4", "seagreen4")) +
labs(
title = "Age of Respondents by JobRole",
subtitle = "Admin Relatively Older, Nurses with Narrower Distribution",
x = "Age in Years",
y = "Count of Respondents"
)
ggplot(data = att, aes(x = JobRole, y = Age, color = JobRole)) +
geom_violin(show.legend = FALSE) +
geom_jitter(show.legend = FALSE, shape = 20, alpha = 0.33) +
theme_test() +
geom_boxplot(width = 0.15) +
labs(
title = "Age by JobRole",
subtitle = "Admin/Administrative Relatively Older"
) +
scale_color_manual(values = c("dodgerblue4", "hotpink4", "lightyellow4", "seagreen4"))
ggplot(data = att, aes(x = Gender, fill = Gender)) +
geom_bar(show.legend = FALSE) +
scale_fill_manual(values = c("hotpink4", "dodgerblue4")) +
theme_test() +
labs(
title = "Gender",
subtitle = "Male Gender Overrepresented",
y = "Count of Respondents"
)
ggplot(data = att, aes(x = JobRole, fill = Gender)) +
geom_bar(color = "dimgrey") +
theme_test() +
scale_fill_manual(values = c("hotpink4", "dodgerblue4")) +
labs(
title = "Gender by JobRole",
subtitle = "Male Gender Overrepresented in Nursing and Other Groups"
)
ggplot(data = att, aes(x = MaritalStatus, fill = Gender)) +
geom_bar(color = "dimgrey") +
theme_test() +
labs(
title = "Marital Status All Respondents",
subtitle = "Slightly Less Than Half are Currently Married"
) +
scale_fill_manual(values = c("hotpink4", "dodgerblue4"))
ggplot(data = att, aes(x = MaritalStatus, fill = Gender)) +
geom_bar(color = "dimgrey") +
theme_test() +
facet_wrap(~JobRole) +
scale_fill_manual(values = c("hotpink4", "dodgerblue4")) +
labs(
title = "Marital Status by JobRole",
subtitle = "Nurses and Other Group Greater Percent Currently Single"
)
Please see each visualization’s subtitle for takeaway messages
ggplot(data = att, aes(x = DistanceFromHome)) +
geom_histogram(binwidth = 2, color = "dimgrey",fill = "grey65") +
geom_vline(aes(xintercept = mean(DistanceFromHome)), color = "black", linewidth = 1.25) +
theme_test() +
scale_x_continuous(breaks = round(seq(min(att$DistanceFromHome), max(att$DistanceFromHome), by = 3),1)) +
labs(
title = "Travel Distance, Home to Workplace",
subtitle = "Most Live Close to Workplace, Skews Farther Away ",
caption = "Median Represented by Line"
)
ggplot(data = att, aes(x = JobRole, y = DistanceFromHome, color = JobRole)) +
geom_violin(show.legend = FALSE) +
geom_jitter(show.legend = FALSE, shape = 20, alpha = 0.33) +
theme_test() +
scale_y_continuous(breaks = round(seq(min(att$DistanceFromHome), max(att$DistanceFromHome), by = 2),1)) +
scale_color_manual(values = c("dodgerblue4", "hotpink4", "lightyellow4", "seagreen4")) +
geom_boxplot(width = 0.15) +
labs(
title = "Distance from Home by JobRole",
subtitle = "Relatively Similar Across Role Groups"
)
ggplot(data = att, aes(x = BusinessTravel, fill= BusinessTravel)) +
geom_bar(color = "dimgrey") +
theme_test() +
scale_fill_manual(values = c("cyan3", "cadetblue3", "cadetblue")) +
labs(
title = "Count of Respondents by Business Travel",
subtitle = "Most Respondents Rarely Travel"
)
ggplot(data = att, aes(x = BusinessTravel, fill = BusinessTravel)) +
geom_bar(color = "dimgrey", show.legend = FALSE) +
scale_fill_manual(values = c("cyan3", "cadetblue3", "cadetblue")) +
theme_test() +
facet_wrap(~JobRole) +
labs(
title = "Similar Work Travel Experiences across Roles",
subtitle = "Limited no. of Respondents Required to Travel for Work"
)
ggplot(data = att, aes(x = HourlyRate)) +
geom_histogram(binwidth = 5, color = "dimgrey",fill = "darkolivegreen4") +
geom_vline(aes(xintercept = mean(HourlyRate)), color = "black", linewidth = 1.25) +
theme_test() +
scale_x_continuous(breaks = round(seq(min(att$HourlyRate), max(att$HourlyRate), by = 5),1)) +
labs(
title = "Hourly Wage Equivalent, All Respondents",
subtitle = "Bimodal Distribution"
)
ggplot(data = att, aes(x = JobRole, y = HourlyRate, color = JobRole)) +
geom_violin(show.legend = FALSE) +
geom_jitter(show.legend = FALSE, shape = 20, alpha = 0.33) +
geom_boxplot(width = 0.15) +
theme_test() +
scale_color_manual(values = c("dodgerblue4", "hotpink4", "lightyellow4", "seagreen4")) +
labs(
title = "Hourly Wage Equivalent, By JobRole",
subtitle = "Similar Wages across Role Groups"
)
ggplot(data = att, aes(x = PercentSalaryHike)) +
geom_histogram(binwidth = 1, color = "dimgrey",fill = "darkolivegreen4") +
geom_vline(aes(xintercept = mean(PercentSalaryHike)), color = "black", linewidth = 1.25) +
theme_test() +
scale_x_continuous(breaks = round(seq(min(att$PercentSalaryHike), max(att$PercentSalaryHike), by = 1),1)) +
labs(
title = "Percent Salary Hike, All Respondents",
subtitle = "Marked Skew Toward Higher Percent Salary Changes"
)
ggplot(data = att, aes(x = JobRole, y = PercentSalaryHike, color = JobRole)) +
geom_violin(show.legend = FALSE) +
geom_jitter(show.legend = FALSE, shape = 20, alpha = 0.33) +
geom_boxplot(width = 0.15) +
theme_test() +
scale_color_manual(values = c("dodgerblue4", "hotpink4", "lightyellow4", "seagreen4")) +
labs(
title = "Percent Salary Hike By JobRole",
subtitle = "Median Therapist Percent Salary Change Higher, Otherwise Similar across Roles"
)
ggplot(data = att, aes(x = OverTime, fill = OverTime)) +
geom_bar(color = "dimgrey")+
theme_test()+
labs(
title = "Over Time Status - All Participants",
subtitle = "Less Thank One Third Participate in Over Time"
) +
scale_fill_manual(values = c("red4", "green4"))
ggplot(data = att, aes(x = OverTime, fill = OverTime)) +
geom_bar(color = "dimgrey") +
theme_test() +
facet_wrap(~JobRole) +
labs(
title = "Over Time Participation by JobRole",
subtitle = "Other Role Group Appears to Have Higher Percentage of Over Time Participation") +
scale_fill_manual(values = c("red4", "green4"))
ggplot(data = att, aes(x = TotalWorkingYears)) +
geom_histogram(binwidth = 2, color = "dimgrey",fill = "darkslategray") +
geom_vline(aes(xintercept = median(TotalWorkingYears)), color = "black", linewidth = 1.25) +
theme_test() +
scale_x_continuous(breaks = round(seq(min(att$TotalWorkingYears), max(att$TotalWorkingYears), by = 2),1)) +
labs(
title = "Total Career Length, All Respondents",
subtitle = "Marked Drop around 12 Years & Heavily Skewed toward Long Careers",
caption = "Median Represented with Line"
)
ggplot(data = att, aes(x = JobRole, y = TotalWorkingYears, color = JobRole)) +
geom_violin(show.legend = FALSE) +
geom_jitter(show.legend = FALSE, shape = 20, alpha = 0.33) +
geom_boxplot(width = 0.15) +
theme_test() +
scale_y_continuous(breaks = round(seq(min(att$TotalWorkingYears), max(att$TotalWorkingYears), by = 5),1)) +
scale_color_manual(values = c("dodgerblue4", "hotpink4", "lightyellow4", "seagreen4")) +
labs(
title = "Total Career Length By JobRole",
subtitle = "Administrative Respondents with Longer Careers, All Roles Skew Right"
)
ggplot(data = att, aes (x = YearsAtCompany)) +
geom_histogram(binwidth = 2, color = "dimgrey", fill = "blue4") +
geom_vline(xintercept = mean(att$YearsAtCompany, color = "black", linewidth = 1.25 )) +
scale_x_continuous(breaks = round(seq(min(att$YearsAtCompany), max(att$YearsAtCompany), by = 2),1)) +
theme_test() +
labs(
title = "Years at Current Company",
subtitle = "Most Respondents with Less Than 10 Years at Company",
caption = "Median Represented with Line"
)
ggplot(data = att, aes(x = JobRole, y = YearsAtCompany, color = JobRole)) +
geom_violin(show.legend = FALSE) +
geom_jitter(show.legend = FALSE, shape = 20, alpha = 0.33) +
geom_boxplot(width = 0.15) +
theme_test() +
scale_y_continuous(breaks = round(seq(min(att$YearsAtCompany), max(att$YearsAtCompany), by = 5),1)) +
scale_color_manual(values = c("dodgerblue4", "hotpink4", "lightyellow4", "seagreen4")) +
labs(
title = "Years at Current Company by JobRole",
subtitle = "Administrative Respondends with More Years at Company"
)
ggplot(data = att, aes (x = YearsInCurrentRole)) +
geom_histogram(binwidth = 1, color = "dimgrey", fill = "blue4") +
geom_vline(xintercept = median(att$YearsInCurrentRole, color = "black", linewidth = 1.25 )) +
scale_x_continuous(breaks = round(seq(min(att$YearsInCurrentRole), max(att$YearsInCurrentRole), by = 2),1)) +
theme_test() +
labs(
title = "Years in Current Role",
subtitle = "Multimodal Distribution",
caption = "Median Represented with Line"
)
ggplot(data = att, aes(x = JobRole, y = YearsInCurrentRole, color = JobRole)) +
geom_violin(show.legend = FALSE) +
geom_jitter(show.legend = FALSE, shape = 20, alpha = 0.33) +
geom_boxplot(width = 0.15) +
theme_test() +
scale_y_continuous(breaks = round(seq(min(att$YearsInCurrentRole), max(att$YearsInCurrentRole), by = 5),1)) +
scale_color_manual(values = c("dodgerblue4", "hotpink4", "lightyellow4", "seagreen4")) +
labs(
title = "Years in Current Role by JobRole",
subtitle = "Nurse and Other with Greater Skew toward Longer Time in Role"
)
ggplot(data = att, aes (x = YearsSinceLastPromotion)) +
geom_histogram(binwidth = 1, color = "dimgrey", fill = "azure3") +
geom_vline(xintercept = median(att$YearsSinceLastPromotion, color = "black", linewidth = 1.25 )) +
scale_x_continuous(breaks = round(seq(min(att$YearsSinceLastPromotion), max(att$YearsSinceLastPromotion), by = 2),1)) +
theme_test() +
labs(
title = "Years Since Last Promotion",
subtitle = "Most Respondents with Promotion in Last 2 Years",
caption = "Median Represented by Line"
)
ggplot(data = att, aes(x = JobRole, y = YearsSinceLastPromotion, color = JobRole)) +
geom_violin(show.legend = FALSE) +
geom_jitter(show.legend = FALSE, shape = 20, alpha = 0.33) +
geom_boxplot(width = 0.15) +
theme_test() +
scale_y_continuous(breaks = round(seq(min(att$YearsSinceLastPromotion), max(att$YearsSinceLastPromotion), by = 5),1)) +
scale_color_manual(values = c("dodgerblue4", "hotpink4", "lightyellow4", "seagreen4")) +
labs(
title = "Years Since Last Promotion by JobRole",
subtitle = "Administrative Group with Greatest Variability"
)
ggplot(data = att, aes(x = JobLevel, fill = JobLevel)) +
geom_bar(color = "dimgrey") +
theme_test() +
scale_fill_manual(values = c("azure", "azure2", "azure3", "azure4", "black")) +
labs(
title = "Job Level All Respondents",
subtitle = "Greater Number of Respondents with Lower Job Levels"
)
Please see each visualization’s subtitle for takeaway messages
ggplot(data = att, aes(x = JobSatisfaction, fill = JobSatisfaction)) +
geom_bar(color = "dimgrey", show.legend = FALSE) +
scale_fill_manual(values = c("azure", "azure2", "azure3", "azure4")) +
theme_test() +
labs(
title = "Job Satisfaction All Respondents",
subtitle = "Generally More Satisfied Than Not"
)
ggplot(data = att, aes(x = JobSatisfaction, fill = JobSatisfaction)) +
geom_bar(color = "dimgrey") +
scale_fill_manual(values = c("azure", "azure2", "azure3", "azure4")) +
facet_wrap(~JobRole) +
theme_test() +
labs(
title = "Job Satisfaction by JobRole",
subtitle = "Nurses Relatively More Satisfied"
)
ggplot(data = att, aes(x = EnvironmentSatisfaction, fill = EnvironmentSatisfaction)) +
geom_bar(color = "dimgrey", show.legend = FALSE) +
scale_fill_manual(values = c("azure", "azure2", "azure3", "azure4")) +
theme_test() +
labs(
title = "Environment Satisfaction All Respondents",
subtitle = "Generally More Favorable Than Not"
)
ggplot(data = att, aes(x = EnvironmentSatisfaction, fill = EnvironmentSatisfaction)) +
geom_bar(color = "dimgrey") +
scale_fill_manual(values = c("azure", "azure2", "azure3", "azure4")) +
facet_wrap(~JobRole) +
theme_test() +
labs(
title = "Environment Satsifaction by JobRole",
subtitle = "Nurses Relatively More Satsified as Compared to Other Roles")
ggplot(data = att, aes(x = WorkLifeBalance, fill = WorkLifeBalance)) +
geom_bar(color = "dimgrey", show.legend = FALSE) +
scale_fill_manual(values = c("azure", "azure2", "azure3", "azure4")) +
theme_test() +
labs(
title = "Work Life Balance All Respondents",
subtilte = "Generally More Satisfied Than Not"
)
ggplot(data = att, aes(x = WorkLifeBalance, fill = WorkLifeBalance)) +
geom_bar(color = "dimgrey") +
scale_fill_manual(values = c("azure", "azure2", "azure3", "azure4")) +
facet_wrap(~JobRole) +
theme_test() +
labs(
title = "Work Life Balance by JobRole",
subtitle = "Similar Experiences Across Role Groups"
)
ggplot(data = att, aes(x = RelationshipSatisfaction, fill = RelationshipSatisfaction)) +
geom_bar(color = "dimgrey", show.legend = FALSE) +
scale_fill_manual(values = c("azure", "azure2", "azure3", "azure4")) +
theme_test()+
labs(
title = "Relationship Satisfaction All Respondents",
subtitle = "Generally More Satisfied Than Not"
)
ggplot(data = att, aes(x = JobSatisfaction, fill = JobSatisfaction)) +
geom_bar(color = "dimgrey") +
scale_fill_manual(values = c("azure", "azure2", "azure3", "azure4")) +
facet_wrap(~JobRole) +
theme_test() +
labs(
title = "Relationship Satisfaction by JobRole",
subtitle = "Diferent Experiences Across Role Groups"
)
ggplot(data = att, aes(x = JobInvolvement, fill = JobInvolvement)) +
geom_bar(color = "dimgrey", show.legend = FALSE) +
scale_fill_manual(values = c("azure", "azure2", "azure3", "azure4")) +
theme_test() +
labs(
title = "Job Involvement All Respondents",
subtitle = "Generally Perceive More Engaged Than Not"
)
ggplot(data = att, aes(x = JobInvolvement, fill = JobInvolvement)) +
geom_bar(color = "dimgrey") +
scale_fill_manual(values = c("azure", "azure2", "azure3", "azure4")) +
facet_wrap(~JobRole) +
theme_test() +
labs(
title = "Job Involvement by JobRole",
subtitle = "Nurses and Other Group Perceived Greater Engagement"
)
ggplot(data = att, aes(x = PerformanceRating, fill = PerformanceRating)) +
geom_bar(color = "dimgrey", show.legend = FALSE) +
scale_fill_manual(values = c("azure3", "azure4")) +
theme_test() +
labs(
title = "Performance Rating All Respondents",
subtitle = "No Poorly Performing Respondents"
)
ggplot(data = att, aes(x = PerformanceRating, fill = PerformanceRating)) +
geom_bar(color = "dimgrey") +
scale_fill_manual(values = c("azure3", "azure4")) +
facet_wrap(~JobRole) +
theme_test()+
labs(
title = "Performance Rating by JobRole",
subtitle = "Similar Ratings Across Role Groups"
)
ggplot(data = att, aes(x = Attrition, fill = Attrition)) +
geom_bar(color = "dimgrey", show.legend = FALSE) +
theme_test() +
scale_fill_manual(values = c("red4", "green4")) +
labs(
title = "Attrition All Respondents",
subtitle = "Recent Resignation or Separation Relatively Rare"
)
ggplot(data = att, aes(x = Attrition, fill = Attrition)) +
geom_bar(color = "dimgrey") +
scale_fill_manual(values = c("red4", "green4")) +
facet_wrap(~JobRole) +
theme_test() +
labs(
title = "Attrition by JobRole",
subtitle = "Relatively Rare Across Role Groups"
)
ggplot(data = att, aes(x = Gender, fill = Attrition)) +
geom_bar(color = "dimgrey") +
theme_test() +
scale_fill_manual(values = c("red4", "green4")) +
labs(
title = "Attrition by Gender",
subtitle = "Similar Experiences for Male and Female Gender"
)
addmargins(table(att$Gender, att$Attrition))
##
## No Yes Sum
## Female 563 86 649
## Male 835 111 946
## Sum 1398 197 1595
Percent_Female_Attrition <- 86/678
Percent_Male_Attrition <- 113/998
Percent_Female_Attrition
## [1] 0.1268437
Percent_Male_Attrition
## [1] 0.1132265
ggplot(data = att, aes(x = Attrition, y = Age, color = Attrition)) +
geom_violin(show.legend = FALSE) +
geom_jitter(show.legend = FALSE, shape = 20, alpha = 0.33) +
geom_boxplot(width = 0.15) +
theme_test() +
scale_color_manual(values = c("red4", "green4")) +
labs(
title = "Age by Attrition Status",
subtitle = "Those who Recently Left Job Appear Younger on Avg"
)
ggplot(data = att, aes(x = Attrition, y = DistanceFromHome, color = Attrition)) +
geom_violin(color = "dimgrey", show.legend = FALSE) +
geom_jitter(show.legend = FALSE, shape = 20, alpha = 0.33) +
geom_boxplot(width = 0.15) +
theme_test() +
scale_color_manual(values = c("red4", "green4")) +
labs(
title = "Travel Distance by Attrition Status",
subtitle = "Greater Variability in Travel Distances for Those Who Left Role"
)
ggplot(data = att, aes(x = Attrition, y = HourlyRate, color = Attrition)) +
geom_violin(show.legend = FALSE) +
geom_jitter(show.legend = FALSE, shape = 20, alpha = 0.33) +
geom_boxplot(width = 0.15) +
theme_test() +
scale_color_manual(values = c("red4", "green4")) +
labs(
title = "Hourly Rate by Attrition Status",
subtitle = "Similar Pay Experiences"
)
ggplot(data = att, aes(x = Attrition, fill = Attrition)) +
geom_bar(color = "dimgrey", show.legend = FALSE) +
facet_wrap(~ OverTime) +
theme_test() +
scale_fill_manual(values = c("red4", "green4")) +
labs(
title = "Attrition by OverTime Participation - All Participants",
subtitle = "Much Higher Percentage of Over Time Participants Leave Job",
x = "No Over Time on Left, At Least Some Over Time on Right"
)
ggplot(data = att, aes(x = Attrition, y = PercentSalaryHike, color = Attrition)) +
geom_violin(show.legend = FALSE) +
geom_jitter(show.legend = FALSE, shape = 20, alpha = 0.33) +
geom_boxplot(width = 0.15) +
theme_test() +
scale_color_manual(values = c("red4", "green4")) +
labs(
title = "Percent Salary Change by Attrition Status",
subtitle = "No Difference"
)
ggplot(data = att, aes(x = Attrition, y = TotalWorkingYears, color = Attrition)) +
geom_violin(show.legend = FALSE) +
geom_jitter(show.legend = FALSE, shape = 20, alpha = 0.33) +
geom_boxplot(width = 0.15) +
theme_test() +
scale_color_manual(values = c("red4", "green4")) +
labs(
title = "Career Duration by Attrition Status",
subtitle = "Shorter Career Duration on Avg for Those Who Left Role"
)
ggplot(data = att, aes(x = Attrition, y = YearsAtCompany, color = Attrition)) +
geom_violin(show.legend = FALSE) +
geom_jitter(show.legend = FALSE, shape = 20, alpha = 0.33) +
geom_boxplot(width = 0.15) +
theme_test() +
scale_color_manual(values = c("red4", "green4")) +
labs(
title = "Years with Current Company by Attrition Status",
subtitle = "Less time with Current Company for Those who Left Role"
)
ggplot(data = att, aes(x = Attrition, y = YearsInCurrentRole, color = Attrition)) +
geom_violin(show.legend = FALSE) +
geom_jitter(show.legend = FALSE, shape = 20, alpha = 0.33) +
geom_boxplot(width = 0.15) +
theme_test() +
scale_color_manual(values = c("red4", "green4")) +
labs(
title = "Years in Current Role by Attrition Status",
subtitle = "Fewer Years in Current Role for Those Who Left Job"
)
ggplot(data = att, aes(x = Attrition, y = YearsSinceLastPromotion, color = Attrition)) +
geom_violin(show.legend = FALSE) +
geom_jitter(show.legend = FALSE, shape = 20, alpha = 0.33) +
geom_boxplot(width = 0.15) +
theme_test() +
scale_color_manual(values = c("red4", "green4")) +
labs(
title = "Years Since Promotion by Attrition Status",
subtitle = "Greater Variability for Those Who Have Not Left Job"
)
ggplot(data = att, aes(x = EnvironmentSatisfaction, fill = Attrition)) +
geom_bar(color = "dimgrey") +
theme_test() +
scale_fill_manual(values = c("red4", "green4")) +
labs(
title = "Attrition by Environment Satisfaction Scores",
subtitle = "Greater Percentage of Less Satsified Leave Role"
)
ggplot(data = att, aes(x = MaritalStatus, fill = Attrition)) +
geom_bar(color = "dimgrey") +
theme_test() +
scale_fill_manual(values = c("red4", "green4")) +
labs(
title = "Attrition by Marital Status",
subtitle = "Higher Percentage of Married Respondents Left Role"
)
ggplot(data = att, aes(x = RelationshipSatisfaction, fill = Attrition)) +
geom_bar(color = "dimgrey") +
theme_test() +
scale_fill_manual(values = c("red4", "green4")) +
labs(
title = "Attrition by Relationship Satisfaction Scores",
subtitle = "Similar Attrition Rates across Grous"
)
ggplot(data = att, aes(x = WorkLifeBalance, fill = Attrition)) +
geom_bar(color = "dimgrey") +
scale_fill_manual(values = c("red4", "green4")) +
theme_test() +
labs(
title = "Attrition by Work Life Balance Scores",
subtitle = "Similar Attrition across Groups"
)
ggplot(data = att, aes(x = JobInvolvement, fill = Attrition)) +
geom_bar(color = "dimgrey") +
theme_test() +
scale_fill_manual(values = c("red4", "green4")) +
labs(
title = "Attrition by Job Involvement Scores",
subtitle = "Higher Percentage of Lower Scores Leave Role"
)
ggplot(data = att, aes(x = JobSatisfaction, fill = Attrition)) +
geom_bar(color = "dimgrey") +
theme_test() +
scale_fill_manual(values = c("red4", "green4")) +
labs(
title = "Attrition by Job Satisfaction",
subtitle = "Similar Gross Attrition but Greater Percentage for Lower Scores"
)
ggplot(data = att, aes(x = PerformanceRating, fill = Attrition)) +
geom_bar(color = "dimgrey") +
theme_test() +
scale_fill_manual(values = c("red4", "green4"))+
labs(
title = "Attrition by Performance Rating",
subtitle = "Similar Attrition Rates across High Performing Respondents"
)
addmargins(table(att$Attrition, att$PerformanceRating))
##
## 1 2 3 4 Sum
## No 0 0 1186 212 1398
## Yes 0 0 165 32 197
## Sum 0 0 1351 244 1595
percent_attrition_score3 <- 167/1424
percent_attrition_score4 <- 32/252
percent_attrition_score3
## [1] 0.1172753
percent_attrition_score4
## [1] 0.1269841
ggplot(data = att, aes(x = JobLevel, fill = Attrition)) +
geom_bar(color = "dimgrey") +
theme_test() +
scale_fill_manual(values = c("red4", "green4")) +
labs(
title = "Attrition by Job Level",
subtitle = "Hihger Percentage of Respondents with Lowest Job Level Leave Role"
)
I divided the dataset with observations with NA for JobLevel removed into a training dataset and testing dataset with 80% of the observations going to former and 20% to the latter.
train_set_size <- floor(0.80 * nrow(att))
set.seed(316)
train_div<-sample(seq_len(nrow(att)), size = train_set_size)
train.att<-att[train_div, ]
test.att <- att[-train_div, ]
dim(train.att)
## [1] 1276 35
dim(test.att)
## [1] 319 35
To identify relevant independent variables to include in the logistic regression models, I focused on the Attrition visualizations and anecdotal evidence shared by healthcare workers regarding their rationales for leaving their jobs.
Specific independent variables included initially include JobRole, Age, DistanceFromHome, TotalWorkingYears, YearsAtCompany YearsInCurrentRole, YearsSinceLastPromotion, EnvironmentSatisfaction, OverTime, MartialStatus, JobInvolvement, JobSatisfaction, and JobLevel.
When considering the temporal variables, there is likely some degree of correlation between Age and TotalWorkingYears, YearsAtCompany, and YearsInCurrentRole. Given this, I will create an additional model with the non-Age temporal variables removed to see how they perform against each other.
It is possible there are associations between Age and MaritalStatus and JobLevel as well. I will pull these variables in a third model that will only include JobRole, Age, DistanceFromHome, YearsSinceLastPromotion, EnvironmentalSatisfaction, OverTime, JobInvolvement, and JobSatisfaction.
Lastly, I have been considering JobRole as a grouping variable and using fixed effects in the first three models. For the fourth model, I will explore a mixed effects model using “random” intercepts for the different role groups. Further grouping or levels is limited by the nature of this dataset.
if (!require(lme4)) install.packages("lme4")
## Loading required package: lme4
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
##
## expand, pack, unpack
library(lme4)
if (!require(jtools)) install.packages("jtools")
## Loading required package: jtools
library(jtools)
if (!require(ResourceSelection)) install.packages("ResourceSelection")
## Loading required package: ResourceSelection
## ResourceSelection 0.3-6 2023-06-27
library(ResourceSelection)
if (!require(lmtest)) install.packages("lmtest")
## Loading required package: lmtest
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
library(lmtest)
reg1 <- glm(Attrition ~ JobRole+Age+DistanceFromHome+TotalWorkingYears+ YearsAtCompany + YearsInCurrentRole + YearsSinceLastPromotion + EnvironmentSatisfaction + OverTime + MaritalStatus + JobInvolvement + JobSatisfaction + JobLevel, data = train.att, family = "binomial")
summ(reg1, exp = TRUE, confint = TRUE)
## MODEL INFO:
## Observations: 1276
## Dependent Variable: Attrition
## Type: Generalized linear model
## Family: binomial
## Link function: logit
##
## MODEL FIT:
## χ²(24) = 508.11, p = 0.00
## Pseudo-R² (Cragg-Uhler) = 0.63
## Pseudo-R² (McFadden) = 0.54
## AIC = 485.75, BIC = 614.54
##
## Standard errors: MLE
## -----------------------------------------------------------------------------
## exp(Est.) 2.5% 97.5% z val. p
## ------------------------------- ------------- ------- ------- -------- ------
## (Intercept) 0.00 0.00 Inf -0.03 0.98
## JobRoleNurse 11946697.92 0.00 Inf 0.03 0.98
## JobRoleOther 8477435.94 0.00 Inf 0.03 0.98
## JobRoleTherapist 4926000.76 0.00 Inf 0.03 0.98
## Age 0.91 0.87 0.95 -4.52 0.00
## DistanceFromHome 1.09 1.06 1.12 5.44 0.00
## TotalWorkingYears 0.97 0.90 1.04 -0.84 0.40
## YearsAtCompany 0.86 0.75 0.99 -2.08 0.04
## YearsInCurrentRole 0.83 0.70 0.99 -2.07 0.04
## YearsSinceLastPromotion 1.27 1.10 1.46 3.25 0.00
## EnvironmentSatisfaction.L 0.20 0.12 0.33 -5.98 0.00
## EnvironmentSatisfaction.Q 2.06 1.22 3.47 2.72 0.01
## EnvironmentSatisfaction.C 1.00 0.60 1.67 0.01 0.99
## OverTimeYes 46.27 23.78 90.05 11.29 0.00
## MaritalStatusMarried 1.71 0.83 3.54 1.44 0.15
## MaritalStatusSingle 10.05 4.74 21.33 6.01 0.00
## JobInvolvement.L 0.12 0.05 0.29 -4.61 0.00
## JobInvolvement.Q 1.17 0.57 2.38 0.43 0.67
## JobInvolvement.C 0.93 0.58 1.50 -0.28 0.78
## JobSatisfaction.L 0.36 0.22 0.61 -3.80 0.00
## JobSatisfaction.Q 0.92 0.54 1.56 -0.30 0.76
## JobSatisfaction.C 0.75 0.44 1.27 -1.08 0.28
## JobLevel.L 2.25 0.45 11.12 0.99 0.32
## JobLevel.Q 4.49 1.46 13.76 2.63 0.01
## JobLevel.C 0.39 0.18 0.84 -2.39 0.02
## -----------------------------------------------------------------------------
hoslem.test(reg1$y,predict(reg1, type = "response"))
##
## Hosmer and Lemeshow goodness of fit (GOF) test
##
## data: reg1$y, predict(reg1, type = "response")
## X-squared = 6.2056, df = 8, p-value = 0.6242
reg2 <- glm(Attrition ~ JobRole+Age+DistanceFromHome + YearsSinceLastPromotion + EnvironmentSatisfaction + OverTime + MaritalStatus + JobInvolvement + JobSatisfaction + JobLevel, data = train.att, family = "binomial")
summ(reg2, exp = TRUE, confint = TRUE)
## MODEL INFO:
## Observations: 1276
## Dependent Variable: Attrition
## Type: Generalized linear model
## Family: binomial
## Link function: logit
##
## MODEL FIT:
## χ²(21) = 469.94, p = 0.00
## Pseudo-R² (Cragg-Uhler) = 0.59
## Pseudo-R² (McFadden) = 0.50
## AIC = 517.92, BIC = 631.25
##
## Standard errors: MLE
## ----------------------------------------------------------------------------
## exp(Est.) 2.5% 97.5% z val. p
## ------------------------------- ------------ ------- ------- -------- ------
## (Intercept) 0.00 0.00 Inf -0.03 0.98
## JobRoleNurse 7546941.07 0.00 Inf 0.02 0.98
## JobRoleOther 4952349.21 0.00 Inf 0.02 0.98
## JobRoleTherapist 3309751.64 0.00 Inf 0.02 0.98
## Age 0.89 0.86 0.92 -6.09 0.00
## DistanceFromHome 1.08 1.05 1.12 5.33 0.00
## YearsSinceLastPromotion 0.99 0.89 1.10 -0.17 0.86
## EnvironmentSatisfaction.L 0.23 0.14 0.37 -5.79 0.00
## EnvironmentSatisfaction.Q 1.85 1.13 3.02 2.45 0.01
## EnvironmentSatisfaction.C 1.00 0.61 1.64 0.01 0.99
## OverTimeYes 37.63 20.23 69.98 11.46 0.00
## MaritalStatusMarried 1.75 0.87 3.54 1.56 0.12
## MaritalStatusSingle 8.69 4.22 17.92 5.86 0.00
## JobInvolvement.L 0.11 0.04 0.25 -5.00 0.00
## JobInvolvement.Q 1.11 0.55 2.22 0.29 0.77
## JobInvolvement.C 0.91 0.58 1.44 -0.40 0.69
## JobSatisfaction.L 0.38 0.23 0.63 -3.80 0.00
## JobSatisfaction.Q 0.77 0.47 1.27 -1.02 0.31
## JobSatisfaction.C 0.85 0.52 1.40 -0.62 0.53
## JobLevel.L 0.83 0.23 2.99 -0.29 0.77
## JobLevel.Q 3.77 1.41 10.09 2.64 0.01
## JobLevel.C 0.34 0.17 0.67 -3.07 0.00
## ----------------------------------------------------------------------------
lrtest(reg1, reg2)
## Likelihood ratio test
##
## Model 1: Attrition ~ JobRole + Age + DistanceFromHome + TotalWorkingYears +
## YearsAtCompany + YearsInCurrentRole + YearsSinceLastPromotion +
## EnvironmentSatisfaction + OverTime + MaritalStatus + JobInvolvement +
## JobSatisfaction + JobLevel
## Model 2: Attrition ~ JobRole + Age + DistanceFromHome + YearsSinceLastPromotion +
## EnvironmentSatisfaction + OverTime + MaritalStatus + JobInvolvement +
## JobSatisfaction + JobLevel
## #Df LogLik Df Chisq Pr(>Chisq)
## 1 25 -217.88
## 2 22 -236.96 -3 38.165 2.607e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
hoslem.test(reg2$y, predict(reg2, type = "response"))
##
## Hosmer and Lemeshow goodness of fit (GOF) test
##
## data: reg2$y, predict(reg2, type = "response")
## X-squared = 3.0868, df = 8, p-value = 0.9288
reg3 <- glm(Attrition ~ JobRole+Age+DistanceFromHome + YearsSinceLastPromotion + EnvironmentSatisfaction + OverTime + JobInvolvement + JobSatisfaction, data = train.att, family = "binomial")
summ(reg3, exp = TRUE, confint = TRUE)
## MODEL INFO:
## Observations: 1276
## Dependent Variable: Attrition
## Type: Generalized linear model
## Family: binomial
## Link function: logit
##
## MODEL FIT:
## χ²(16) = 366.50, p = 0.00
## Pseudo-R² (Cragg-Uhler) = 0.48
## Pseudo-R² (McFadden) = 0.39
## AIC = 611.36, BIC = 698.94
##
## Standard errors: MLE
## ----------------------------------------------------------------------------
## exp(Est.) 2.5% 97.5% z val. p
## ------------------------------- ------------ ------- ------- -------- ------
## (Intercept) 0.00 0.00 Inf -0.02 0.98
## JobRoleNurse 4758164.12 0.00 Inf 0.02 0.98
## JobRoleOther 6108699.25 0.00 Inf 0.02 0.98
## JobRoleTherapist 1128628.72 0.00 Inf 0.02 0.98
## Age 0.87 0.84 0.90 -8.03 0.00
## DistanceFromHome 1.06 1.03 1.08 4.23 0.00
## YearsSinceLastPromotion 0.98 0.89 1.06 -0.55 0.58
## EnvironmentSatisfaction.L 0.32 0.21 0.49 -5.25 0.00
## EnvironmentSatisfaction.Q 2.08 1.33 3.25 3.20 0.00
## EnvironmentSatisfaction.C 0.88 0.56 1.38 -0.57 0.57
## OverTimeYes 18.57 11.30 30.53 11.52 0.00
## JobInvolvement.L 0.14 0.06 0.30 -4.95 0.00
## JobInvolvement.Q 0.99 0.53 1.85 -0.02 0.98
## JobInvolvement.C 0.85 0.57 1.28 -0.76 0.44
## JobSatisfaction.L 0.39 0.25 0.61 -4.20 0.00
## JobSatisfaction.Q 0.78 0.50 1.22 -1.08 0.28
## JobSatisfaction.C 0.85 0.54 1.33 -0.71 0.48
## ----------------------------------------------------------------------------
hoslem.test(reg3$y, predict(reg3, type = "response"))
##
## Hosmer and Lemeshow goodness of fit (GOF) test
##
## data: reg3$y, predict(reg3, type = "response")
## X-squared = 5.6272, df = 8, p-value = 0.6889
reg4 <- glmer(Attrition ~ Age + DistanceFromHome + TotalWorkingYears + YearsAtCompany + YearsInCurrentRole + YearsSinceLastPromotion + EnvironmentSatisfaction + OverTime + MaritalStatus + JobInvolvement + JobSatisfaction + JobLevel + (1|JobRole), data = train.att, family = "binomial")
summ(reg4, exp = TRUE, confint = TRUE)
## MODEL INFO:
## Observations: 1276
## Dependent Variable: Attrition
## Type: Mixed effects generalized linear regression
## Error Distribution: binomial
## Link function: logit
##
## MODEL FIT:
## AIC = 493.29, BIC = 611.77
## Pseudo-R² (fixed effects) = 0.75
## Pseudo-R² (total) = 0.75
##
## FIXED EFFECTS:
## ---------------------------------------------------------------------------
## exp(Est.) 2.5% 97.5% z val. p
## ------------------------------- ----------- ------- ------- -------- ------
## (Intercept) 0.27 0.05 1.35 -1.60 0.11
## Age 0.90 0.86 0.94 -4.70 0.00
## DistanceFromHome 1.09 1.05 1.12 5.39 0.00
## TotalWorkingYears 0.97 0.90 1.05 -0.76 0.45
## YearsAtCompany 0.88 0.78 1.00 -2.04 0.04
## YearsInCurrentRole 0.82 0.70 0.97 -2.38 0.02
## YearsSinceLastPromotion 1.27 1.10 1.47 3.33 0.00
## EnvironmentSatisfaction.L 0.20 0.12 0.34 -5.95 0.00
## EnvironmentSatisfaction.Q 2.04 1.21 3.42 2.69 0.01
## EnvironmentSatisfaction.C 0.99 0.60 1.65 -0.03 0.97
## OverTimeYes 44.56 23.03 86.19 11.28 0.00
## MaritalStatusMarried 1.54 0.74 3.17 1.16 0.25
## MaritalStatusSingle 9.26 4.40 19.48 5.87 0.00
## JobInvolvement.L 0.12 0.05 0.28 -4.73 0.00
## JobInvolvement.Q 1.17 0.58 2.37 0.45 0.65
## JobInvolvement.C 0.98 0.61 1.56 -0.10 0.92
## JobSatisfaction.L 0.36 0.22 0.60 -3.90 0.00
## JobSatisfaction.Q 0.97 0.58 1.63 -0.11 0.91
## JobSatisfaction.C 0.70 0.42 1.17 -1.36 0.18
## JobLevel.L 0.83 0.21 3.27 -0.27 0.79
## JobLevel.Q 2.61 0.98 6.95 1.92 0.05
## JobLevel.C 0.35 0.17 0.72 -2.85 0.00
## ---------------------------------------------------------------------------
##
## RANDOM EFFECTS:
## -----------------------------------
## Group Parameter Std. Dev.
## --------- ------------- -----------
## JobRole (Intercept) 0.13
## -----------------------------------
##
## Grouping variables:
## ---------------------------
## Group # groups ICC
## --------- ---------- ------
## JobRole 4 0.00
## ---------------------------
lrtest(reg1, reg2)
## Likelihood ratio test
##
## Model 1: Attrition ~ JobRole + Age + DistanceFromHome + TotalWorkingYears +
## YearsAtCompany + YearsInCurrentRole + YearsSinceLastPromotion +
## EnvironmentSatisfaction + OverTime + MaritalStatus + JobInvolvement +
## JobSatisfaction + JobLevel
## Model 2: Attrition ~ JobRole + Age + DistanceFromHome + YearsSinceLastPromotion +
## EnvironmentSatisfaction + OverTime + MaritalStatus + JobInvolvement +
## JobSatisfaction + JobLevel
## #Df LogLik Df Chisq Pr(>Chisq)
## 1 25 -217.88
## 2 22 -236.96 -3 38.165 2.607e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
lrtest(reg1, reg3)
## Likelihood ratio test
##
## Model 1: Attrition ~ JobRole + Age + DistanceFromHome + TotalWorkingYears +
## YearsAtCompany + YearsInCurrentRole + YearsSinceLastPromotion +
## EnvironmentSatisfaction + OverTime + MaritalStatus + JobInvolvement +
## JobSatisfaction + JobLevel
## Model 2: Attrition ~ JobRole + Age + DistanceFromHome + YearsSinceLastPromotion +
## EnvironmentSatisfaction + OverTime + JobInvolvement + JobSatisfaction
## #Df LogLik Df Chisq Pr(>Chisq)
## 1 25 -217.88
## 2 17 -288.68 -8 141.61 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
lrtest(reg2, reg3)
## Likelihood ratio test
##
## Model 1: Attrition ~ JobRole + Age + DistanceFromHome + YearsSinceLastPromotion +
## EnvironmentSatisfaction + OverTime + MaritalStatus + JobInvolvement +
## JobSatisfaction + JobLevel
## Model 2: Attrition ~ JobRole + Age + DistanceFromHome + YearsSinceLastPromotion +
## EnvironmentSatisfaction + OverTime + JobInvolvement + JobSatisfaction
## #Df LogLik Df Chisq Pr(>Chisq)
## 1 22 -236.96
## 2 17 -288.68 -5 103.45 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
I selected the reg1 model over the other three as there was evidence that one of the models outperformed the other fixed effects models. While Reg4 had the highest pseudo-R^2, reg1 had a lower AIC. Also, reg4’s ICC was so poor that I am unsure of the value of including the random intercept for JobRole.
if (!require(car)) install.packages("car")
## Loading required package: car
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
## The following object is masked from 'package:purrr':
##
## some
library(car)
residualPlots(reg1)
## Test stat Pr(>|Test stat|)
## JobRole
## Age 5.4193 0.01992 *
## DistanceFromHome 0.2694 0.60376
## TotalWorkingYears 1.2127 0.27079
## YearsAtCompany 2.4149 0.12018
## YearsInCurrentRole 0.8365 0.36041
## YearsSinceLastPromotion 2.2510 0.13353
## EnvironmentSatisfaction
## OverTime
## MaritalStatus
## JobInvolvement
## JobSatisfaction
## JobLevel
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Based on the above plots, there are little to no relationships between the residuals and the continuous independent variables in reg1. Interestingly, the p-value for age is quite small. If I consider this one piece of evidence alone, I need to consider rejecting my null hypothesis that age fits the model well. I think the best course of action may be to compare two additional models, one with and one without age, alongside any additional changes that may arise with the continued assumption testing.
marginalModelPlots(reg1)
## Warning in mmps(...): Interactions and/or factors skipped
The estimated relationship between independent variables and Attrition and the model predictions for the relationship is strong as evidenced by the consistent overlap of the “Data” and “Model” lines. However, these relationships deviate from being linear.
linearity.check <- train.att %>%
select(Age, DistanceFromHome, TotalWorkingYears, YearsAtCompany, YearsInCurrentRole, YearsSinceLastPromotion)
predictors <- colnames(linearity.check)
linearity.check$probabilities <- reg1$fitted.values
linearity.check <- linearity.check %>%
mutate(logit = log(probabilities/(1 - probabilities))) %>%
select(-probabilities) %>%
gather(key = "predictors", value = "predictor.values", -logit)
ggplot(linearity.check, aes(y = logit, x = predictor.values)) +
geom_point() +
stat_smooth(method = "loess") +
theme_test() +
facet_wrap(~predictors, scales = "free_x")
## `geom_smooth()` using formula = 'y ~ x'
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : pseudoinverse used at -0.075
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : neighborhood radius 2.075
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : reciprocal condition number 2.4351e-15
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : There are other near singularities as well. 4
## Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
## else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : pseudoinverse used at
## -0.075
## Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
## else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : neighborhood radius
## 2.075
## Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
## else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : reciprocal condition
## number 2.4351e-15
## Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
## else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : There are other near
## singularities as well. 4
I decided to assess the actual logits against the continuous independent variables. These grossly appear linear. What is more notable to me is the clustering of lower value logits. The way I visualized these plots disallows me from identifying if these reflect a specific level of a categorical independent variable.
For now, I will not plan on adding any polynomial terms or any other transformations.
Conceptually, there could be some degree of interaction between or among these independent variables that I have not addressed. I am most concerned about an interaction between JobLevel and the temporal variables.
reg1.cont.ind <- att[c("Age", "DistanceFromHome", "TotalWorkingYears", "YearsAtCompany", "YearsInCurrentRole", "YearsSinceLastPromotion")]
cor(reg1.cont.ind)
## Age DistanceFromHome TotalWorkingYears
## Age 1.00000000 0.01533062 0.66155501
## DistanceFromHome 0.01533062 1.00000000 0.02189604
## TotalWorkingYears 0.66155501 0.02189604 1.00000000
## YearsAtCompany 0.29307296 0.02097758 0.61806113
## YearsInCurrentRole 0.21150996 0.02618534 0.47253042
## YearsSinceLastPromotion 0.19748605 0.01630781 0.39211092
## YearsAtCompany YearsInCurrentRole
## Age 0.29307296 0.21150996
## DistanceFromHome 0.02097758 0.02618534
## TotalWorkingYears 0.61806113 0.47253042
## YearsAtCompany 1.00000000 0.77189266
## YearsInCurrentRole 0.77189266 1.00000000
## YearsSinceLastPromotion 0.61159314 0.54189737
## YearsSinceLastPromotion
## Age 0.19748605
## DistanceFromHome 0.01630781
## TotalWorkingYears 0.39211092
## YearsAtCompany 0.61159314
## YearsInCurrentRole 0.54189737
## YearsSinceLastPromotion 1.00000000
vif(reg1)
## GVIF Df GVIF^(1/(2*Df))
## JobRole 1.486018 3 1.068245
## Age 1.991913 1 1.411351
## DistanceFromHome 1.221443 1 1.105189
## TotalWorkingYears 3.268777 1 1.807976
## YearsAtCompany 4.621883 1 2.149857
## YearsInCurrentRole 3.343026 1 1.828394
## YearsSinceLastPromotion 2.280921 1 1.510272
## EnvironmentSatisfaction 1.506820 3 1.070722
## OverTime 1.860166 1 1.363879
## MaritalStatus 1.350665 2 1.078045
## JobInvolvement 1.224774 3 1.034370
## JobSatisfaction 1.263163 3 1.039704
## JobLevel 3.942323 3 1.256875
As expected in the initial model specifications, there are moderate to strong relationships among the temporal independent variables. Notably, Age and TotalWorkingYears, TotalWorkingYears and YearsAtCompany, and YearsInCurrentRole and YearsAtCompany.
When considering the Variance Inflation Factor, YearsAtCompany is relatively large. I think it may be most reasonable to generate a new model by retaining all of the independent variables save for YearsAtCompany. I will also consider whether or note to include Age based on comparing two models, one with and one without Age.
No participants rated their JobPerformance below a 3. Fortunately, I did not include this variable in any of my models. Therefore, I have no concerns regarding this modelling assumption.
No independent variables perfectly or near perfectly predict the probabilities for Attrition.
outlierTest(reg1, labels = train.att$EmployeeID)
## No Studentized residuals with Bonferroni p < 0.05
## Largest |rstudent|:
## rstudent unadjusted p-value Bonferroni p
## 1536328 3.30954 0.00093449 NA
influencePlot(reg1)
## StudRes Hat CookD
## 1601 -0.0009942438 0.368433609 1.413778e-08
## 317 -0.9500951052 0.239881472 7.645568e-03
## 1221 3.1026592081 0.024904990 6.059328e-02
## 964 3.3095401360 0.005506775 3.446135e-02
## 973 2.2994156966 0.134289854 5.092781e-02
Observations 317, 964, 973, 1221, and 1601 demonstrated significant influence. I will investigate these observations by first by pulling them into their own table. I will then highlight them on scatterplots of the continuous variables.
Outliers.reg1 <- train.att %>%
filter(EmployeeID %in% c(1230043, 1536328, 1081600, 1675357, 1553672)) %>%
select(EmployeeID, JobRole, Age, DistanceFromHome, TotalWorkingYears, YearsAtCompany, YearsInCurrentRole, YearsSinceLastPromotion, EnvironmentSatisfaction, OverTime, MaritalStatus, JobInvolvement, JobSatisfaction, JobLevel)
print(Outliers.reg1)
## EmployeeID JobRole Age DistanceFromHome TotalWorkingYears
## 1 1553672 Administrative 47 29 10
## 2 1230043 Nurse 49 1 25
## 3 1675357 Nurse 47 9 25
## 4 1536328 Therapist 58 7 31
## 5 1081600 Nurse 55 13 24
## YearsAtCompany YearsInCurrentRole YearsSinceLastPromotion
## 1 10 7 9
## 2 7 1 0
## 3 23 5 14
## 4 10 9 5
## 5 19 7 3
## EnvironmentSatisfaction OverTime MaritalStatus JobInvolvement JobSatisfaction
## 1 1 Yes Married 3 2
## 2 3 Yes Single 2 3
## 3 3 No Married 1 3
## 4 3 Yes Married 2 1
## 5 1 Yes Single 4 3
## JobLevel
## 1 3
## 2 4
## 3 4
## 4 3
## 5 4
ggplot(data = train.att, aes(x = Age)) +
geom_histogram(binwidth = 2, color = "dimgrey", fill = "grey65") +
annotate(geom = "text", x = 47, y = 10, label = "155367", color = "dodgerblue4", size = 2.5) +
annotate(geom = "text", x = 49, y = 5, label = "1230043", color = "hotpink4", size = 2.5) +
annotate(geom = "text", x = 47, y = 20, label = "1675357", color = "hotpink4", size = 2.5) +
annotate(geom = "text", x = 58, y = 23, label = "1536328", color = "seagreen4", size = 2.5) +
annotate(geom = "text", x = 55, y = 20, label = "1081600", color = "hotpink4", size = 2.5) +
theme_test() +
labs(
x = "Age",
y = "Count",
title = "Potential Outliers' Ages",
subtitle = "All toward Upper Limit but Reasonable Values"
)
Based on the Assumption Diagnostic Testing, I decided to definitively remove YearsAtCompany. Lastly, I will create two models, one with Age and one without Age.
reg5 <- glm(Attrition ~ JobRole+Age+DistanceFromHome+TotalWorkingYears + YearsInCurrentRole + YearsSinceLastPromotion + EnvironmentSatisfaction + OverTime + MaritalStatus + JobInvolvement + JobSatisfaction + JobLevel, data = train.att, family = "binomial")
summ(reg5, exp = TRUE, confint = TRUE)
## MODEL INFO:
## Observations: 1276
## Dependent Variable: Attrition
## Type: Generalized linear model
## Family: binomial
## Link function: logit
##
## MODEL FIT:
## χ²(23) = 503.25, p = 0.00
## Pseudo-R² (Cragg-Uhler) = 0.62
## Pseudo-R² (McFadden) = 0.53
## AIC = 488.61, BIC = 612.24
##
## Standard errors: MLE
## -----------------------------------------------------------------------------
## exp(Est.) 2.5% 97.5% z val. p
## ------------------------------- ------------- ------- ------- -------- ------
## (Intercept) 0.00 0.00 Inf -0.03 0.98
## JobRoleNurse 10302592.11 0.00 Inf 0.03 0.98
## JobRoleOther 6940935.92 0.00 Inf 0.03 0.98
## JobRoleTherapist 4182039.74 0.00 Inf 0.03 0.98
## Age 0.91 0.87 0.95 -4.44 0.00
## DistanceFromHome 1.09 1.06 1.12 5.37 0.00
## TotalWorkingYears 0.95 0.88 1.02 -1.38 0.17
## YearsInCurrentRole 0.74 0.65 0.84 -4.73 0.00
## YearsSinceLastPromotion 1.21 1.06 1.38 2.84 0.00
## EnvironmentSatisfaction.L 0.20 0.12 0.34 -5.95 0.00
## EnvironmentSatisfaction.Q 1.96 1.17 3.27 2.57 0.01
## EnvironmentSatisfaction.C 1.06 0.64 1.76 0.24 0.81
## OverTimeYes 44.70 23.16 86.27 11.33 0.00
## MaritalStatusMarried 1.74 0.84 3.59 1.49 0.14
## MaritalStatusSingle 9.87 4.67 20.85 6.00 0.00
## JobInvolvement.L 0.12 0.05 0.29 -4.67 0.00
## JobInvolvement.Q 1.19 0.58 2.43 0.48 0.63
## JobInvolvement.C 0.92 0.58 1.48 -0.33 0.74
## JobSatisfaction.L 0.36 0.21 0.60 -3.90 0.00
## JobSatisfaction.Q 0.96 0.57 1.63 -0.14 0.89
## JobSatisfaction.C 0.72 0.43 1.21 -1.24 0.22
## JobLevel.L 1.28 0.28 5.93 0.32 0.75
## JobLevel.Q 3.03 1.04 8.78 2.04 0.04
## JobLevel.C 0.32 0.15 0.67 -3.01 0.00
## -----------------------------------------------------------------------------
reg6 <- glm(Attrition ~ JobRole+ DistanceFromHome+TotalWorkingYears + YearsInCurrentRole + YearsSinceLastPromotion + EnvironmentSatisfaction + OverTime +MaritalStatus + JobInvolvement + JobSatisfaction + JobLevel, data = train.att, family = "binomial")
summ(reg6, exp = TRUE, confint = TRUE)
## MODEL INFO:
## Observations: 1276
## Dependent Variable: Attrition
## Type: Generalized linear model
## Family: binomial
## Link function: logit
##
## MODEL FIT:
## χ²(22) = 479.92, p = 0.00
## Pseudo-R² (Cragg-Uhler) = 0.60
## Pseudo-R² (McFadden) = 0.51
## AIC = 509.94, BIC = 628.42
##
## Standard errors: MLE
## -----------------------------------------------------------------------------
## exp(Est.) 2.5% 97.5% z val. p
## ------------------------------- ------------- ------- ------- -------- ------
## (Intercept) 0.00 0.00 Inf -0.03 0.97
## JobRoleNurse 15450797.13 0.00 Inf 0.03 0.98
## JobRoleOther 10140302.85 0.00 Inf 0.03 0.98
## JobRoleTherapist 6060016.86 0.00 Inf 0.03 0.98
## DistanceFromHome 1.09 1.05 1.12 5.32 0.00
## TotalWorkingYears 0.87 0.81 0.93 -4.22 0.00
## YearsInCurrentRole 0.76 0.68 0.87 -4.25 0.00
## YearsSinceLastPromotion 1.20 1.05 1.36 2.78 0.01
## EnvironmentSatisfaction.L 0.22 0.13 0.37 -5.80 0.00
## EnvironmentSatisfaction.Q 1.91 1.16 3.14 2.55 0.01
## EnvironmentSatisfaction.C 1.00 0.61 1.63 -0.01 0.99
## OverTimeYes 33.64 18.46 61.31 11.48 0.00
## MaritalStatusMarried 1.65 0.81 3.37 1.38 0.17
## MaritalStatusSingle 9.58 4.65 19.73 6.13 0.00
## JobInvolvement.L 0.12 0.05 0.28 -4.79 0.00
## JobInvolvement.Q 1.20 0.60 2.42 0.52 0.60
## JobInvolvement.C 0.90 0.57 1.42 -0.46 0.64
## JobSatisfaction.L 0.39 0.24 0.64 -3.71 0.00
## JobSatisfaction.Q 1.10 0.66 1.83 0.37 0.71
## JobSatisfaction.C 0.71 0.42 1.18 -1.33 0.18
## JobLevel.L 1.56 0.35 6.96 0.58 0.56
## JobLevel.Q 3.32 1.16 9.53 2.23 0.03
## JobLevel.C 0.36 0.17 0.75 -2.73 0.01
## -----------------------------------------------------------------------------
There was little change in compartive model performance with Age removed. Given this, I will move forward with diagnostic testing of Reg6.
residualPlots(reg6)
## Test stat Pr(>|Test stat|)
## JobRole
## DistanceFromHome 0.0547 0.81506
## TotalWorkingYears 3.8661 0.04927 *
## YearsInCurrentRole 1.7947 0.18036
## YearsSinceLastPromotion 2.8233 0.09290 .
## EnvironmentSatisfaction
## OverTime
## MaritalStatus
## JobInvolvement
## JobSatisfaction
## JobLevel
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
marginalModelPlots(reg6)
## Warning in mmps(...): Interactions and/or factors skipped
linearity.check.new <-train.att %>%
select(Age, DistanceFromHome, TotalWorkingYears, YearsInCurrentRole, YearsSinceLastPromotion)
predictors.new <- colnames(linearity.check.new)
linearity.check.new$probabilities <- reg6$fitted.values
linearity.check.new <- linearity.check.new %>%
mutate(logit = log(probabilities/(1 - probabilities))) %>%
select(-probabilities) %>%
gather(key = "predictors", value = "predictor.values", -logit)
ggplot(linearity.check.new, aes(y = logit, x = predictor.values)) +
geom_point() +
stat_smooth(method = "loess") +
theme_test() +
facet_wrap(~predictors, scales = "free_x")
## `geom_smooth()` using formula = 'y ~ x'
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : pseudoinverse used at -0.075
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : neighborhood radius 2.075
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : reciprocal condition number 2.4351e-15
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : There are other near singularities as well. 4
## Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
## else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : pseudoinverse used at
## -0.075
## Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
## else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : neighborhood radius
## 2.075
## Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
## else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : reciprocal condition
## number 2.4351e-15
## Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
## else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : There are other near
## singularities as well. 4
reg6.cont.ind <- att[c("DistanceFromHome", "TotalWorkingYears", "YearsInCurrentRole", "YearsSinceLastPromotion")]
cor(reg6.cont.ind)
## DistanceFromHome TotalWorkingYears YearsInCurrentRole
## DistanceFromHome 1.00000000 0.02189604 0.02618534
## TotalWorkingYears 0.02189604 1.00000000 0.47253042
## YearsInCurrentRole 0.02618534 0.47253042 1.00000000
## YearsSinceLastPromotion 0.01630781 0.39211092 0.54189737
## YearsSinceLastPromotion
## DistanceFromHome 0.01630781
## TotalWorkingYears 0.39211092
## YearsInCurrentRole 0.54189737
## YearsSinceLastPromotion 1.00000000
vif(reg6)
## GVIF Df GVIF^(1/(2*Df))
## JobRole 1.502246 3 1.070180
## DistanceFromHome 1.198696 1 1.094850
## TotalWorkingYears 2.467401 1 1.570796
## YearsInCurrentRole 1.743091 1 1.320262
## YearsSinceLastPromotion 1.880525 1 1.371322
## EnvironmentSatisfaction 1.382737 3 1.055496
## OverTime 1.602801 1 1.266018
## MaritalStatus 1.296886 2 1.067150
## JobInvolvement 1.215446 3 1.033053
## JobSatisfaction 1.217971 3 1.033410
## JobLevel 3.427611 3 1.227906
Similar to the issue of Age in the initial model iterations, I now have evidence that TotalWorkingYears does not fit the logistic regression model well. I will generate an additional model with this variable removed and compare its performance against reg6.
reg7 <- glm(Attrition ~ JobRole+ DistanceFromHome + YearsInCurrentRole + YearsSinceLastPromotion + EnvironmentSatisfaction + OverTime +MaritalStatus + JobInvolvement + JobSatisfaction + JobLevel, data = train.att, family = "binomial")
summ(reg7, exp = TRUE, confint = TRUE)
## MODEL INFO:
## Observations: 1276
## Dependent Variable: Attrition
## Type: Generalized linear model
## Family: binomial
## Link function: logit
##
## MODEL FIT:
## χ²(21) = 458.45, p = 0.00
## Pseudo-R² (Cragg-Uhler) = 0.58
## Pseudo-R² (McFadden) = 0.49
## AIC = 529.41, BIC = 642.74
##
## Standard errors: MLE
## -----------------------------------------------------------------------------
## exp(Est.) 2.5% 97.5% z val. p
## ------------------------------- ------------- ------- ------- -------- ------
## (Intercept) 0.00 0.00 Inf -0.04 0.97
## JobRoleNurse 15200187.77 0.00 Inf 0.03 0.98
## JobRoleOther 10069415.82 0.00 Inf 0.03 0.98
## JobRoleTherapist 5417490.74 0.00 Inf 0.03 0.98
## DistanceFromHome 1.08 1.05 1.11 5.30 0.00
## YearsInCurrentRole 0.73 0.65 0.82 -5.36 0.00
## YearsSinceLastPromotion 1.16 1.03 1.31 2.46 0.01
## EnvironmentSatisfaction.L 0.25 0.15 0.41 -5.44 0.00
## EnvironmentSatisfaction.Q 1.65 1.02 2.67 2.06 0.04
## EnvironmentSatisfaction.C 0.93 0.58 1.49 -0.31 0.75
## OverTimeYes 28.47 16.03 50.57 11.43 0.00
## MaritalStatusMarried 1.81 0.89 3.67 1.64 0.10
## MaritalStatusSingle 10.42 5.07 21.39 6.38 0.00
## JobInvolvement.L 0.11 0.05 0.26 -5.05 0.00
## JobInvolvement.Q 1.18 0.60 2.32 0.49 0.63
## JobInvolvement.C 0.88 0.56 1.38 -0.57 0.57
## JobSatisfaction.L 0.44 0.27 0.70 -3.40 0.00
## JobSatisfaction.Q 0.96 0.59 1.57 -0.16 0.87
## JobSatisfaction.C 0.79 0.49 1.30 -0.92 0.36
## JobLevel.L 0.28 0.08 0.99 -1.98 0.05
## JobLevel.Q 2.50 0.91 6.85 1.78 0.07
## JobLevel.C 0.32 0.16 0.66 -3.09 0.00
## -----------------------------------------------------------------------------
There is little change in model performance with and without TotalWorkingYears. Given this, I will move ahead with diagnostic testing of assumptions for the newest model, reg7.
residualPlots(reg7)
## Test stat Pr(>|Test stat|)
## JobRole
## DistanceFromHome 0.3125 0.5762
## YearsInCurrentRole 1.1908 0.2752
## YearsSinceLastPromotion 2.0170 0.1555
## EnvironmentSatisfaction
## OverTime
## MaritalStatus
## JobInvolvement
## JobSatisfaction
## JobLevel
marginalModelPlots(reg7)
## Warning in mmps(...): Interactions and/or factors skipped
linearity.check.new <-train.att %>%
select(DistanceFromHome, YearsInCurrentRole, YearsSinceLastPromotion)
predictors.new <- colnames(linearity.check.new)
linearity.check.new$probabilities <- reg7$fitted.values
linearity.check.new <- linearity.check.new %>%
mutate(logit = log(probabilities/(1 - probabilities))) %>%
select(-probabilities) %>%
gather(key = "predictors", value = "predictor.values", -logit)
ggplot(linearity.check.new, aes(y = logit, x = predictor.values)) +
geom_point() +
stat_smooth(method = "loess") +
theme_test() +
facet_wrap(~predictors, scales = "free_x")
## `geom_smooth()` using formula = 'y ~ x'
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : pseudoinverse used at -0.075
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : neighborhood radius 2.075
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : reciprocal condition number 2.4351e-15
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : There are other near singularities as well. 4
## Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
## else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : pseudoinverse used at
## -0.075
## Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
## else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : neighborhood radius
## 2.075
## Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
## else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : reciprocal condition
## number 2.4351e-15
## Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
## else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : There are other near
## singularities as well. 4
reg7.cont.ind <- att[c("DistanceFromHome", "YearsInCurrentRole", "YearsSinceLastPromotion")]
cor(reg7.cont.ind)
## DistanceFromHome YearsInCurrentRole
## DistanceFromHome 1.00000000 0.02618534
## YearsInCurrentRole 0.02618534 1.00000000
## YearsSinceLastPromotion 0.01630781 0.54189737
## YearsSinceLastPromotion
## DistanceFromHome 0.01630781
## YearsInCurrentRole 0.54189737
## YearsSinceLastPromotion 1.00000000
vif(reg7)
## GVIF Df GVIF^(1/(2*Df))
## JobRole 1.452671 3 1.064211
## DistanceFromHome 1.180258 1 1.086397
## YearsInCurrentRole 1.636493 1 1.279255
## YearsSinceLastPromotion 1.810452 1 1.345530
## EnvironmentSatisfaction 1.317550 3 1.047035
## OverTime 1.540195 1 1.241046
## MaritalStatus 1.309297 2 1.069694
## JobInvolvement 1.229895 3 1.035090
## JobSatisfaction 1.162593 3 1.025427
## JobLevel 2.230672 3 1.143069
outlierTest(reg7)
## No Studentized residuals with Bonferroni p < 0.05
## Largest |rstudent|:
## rstudent unadjusted p-value Bonferroni p
## 834 3.071789 0.0021278 NA
influenceIndexPlot(reg7)
influencePlot(reg7)
## StudRes Hat CookD
## 317 -1.083660122 0.218460028 1.047011e-02
## 1084 -0.001454763 0.585381756 9.600931e-08
## 1221 2.587546650 0.056571533 4.736500e-02
## 572 2.737863674 0.006265692 1.060972e-02
## 834 3.071789049 0.004706782 1.934519e-02
## 973 1.810158045 0.199348469 3.794619e-02
reg7.no.outlier <- glm(Attrition ~ JobRole+ DistanceFromHome + YearsInCurrentRole + YearsSinceLastPromotion + EnvironmentSatisfaction + OverTime +MaritalStatus + JobInvolvement + JobSatisfaction + JobLevel, data = train.att, family = "binomial")
summ(reg7.no.outlier, exp = TRUE, confint = TRUE)
## MODEL INFO:
## Observations: 1276
## Dependent Variable: Attrition
## Type: Generalized linear model
## Family: binomial
## Link function: logit
##
## MODEL FIT:
## χ²(21) = 458.45, p = 0.00
## Pseudo-R² (Cragg-Uhler) = 0.58
## Pseudo-R² (McFadden) = 0.49
## AIC = 529.41, BIC = 642.74
##
## Standard errors: MLE
## -----------------------------------------------------------------------------
## exp(Est.) 2.5% 97.5% z val. p
## ------------------------------- ------------- ------- ------- -------- ------
## (Intercept) 0.00 0.00 Inf -0.04 0.97
## JobRoleNurse 15200187.77 0.00 Inf 0.03 0.98
## JobRoleOther 10069415.82 0.00 Inf 0.03 0.98
## JobRoleTherapist 5417490.74 0.00 Inf 0.03 0.98
## DistanceFromHome 1.08 1.05 1.11 5.30 0.00
## YearsInCurrentRole 0.73 0.65 0.82 -5.36 0.00
## YearsSinceLastPromotion 1.16 1.03 1.31 2.46 0.01
## EnvironmentSatisfaction.L 0.25 0.15 0.41 -5.44 0.00
## EnvironmentSatisfaction.Q 1.65 1.02 2.67 2.06 0.04
## EnvironmentSatisfaction.C 0.93 0.58 1.49 -0.31 0.75
## OverTimeYes 28.47 16.03 50.57 11.43 0.00
## MaritalStatusMarried 1.81 0.89 3.67 1.64 0.10
## MaritalStatusSingle 10.42 5.07 21.39 6.38 0.00
## JobInvolvement.L 0.11 0.05 0.26 -5.05 0.00
## JobInvolvement.Q 1.18 0.60 2.32 0.49 0.63
## JobInvolvement.C 0.88 0.56 1.38 -0.57 0.57
## JobSatisfaction.L 0.44 0.27 0.70 -3.40 0.00
## JobSatisfaction.Q 0.96 0.59 1.57 -0.16 0.87
## JobSatisfaction.C 0.79 0.49 1.30 -0.92 0.36
## JobLevel.L 0.28 0.08 0.99 -1.98 0.05
## JobLevel.Q 2.50 0.91 6.85 1.78 0.07
## JobLevel.C 0.32 0.16 0.66 -3.09 0.00
## -----------------------------------------------------------------------------
AIC(reg7, reg7.no.outlier)
## df AIC
## reg7 22 529.4085
## reg7.no.outlier 22 529.4085
influencePlot(reg7.no.outlier)
## StudRes Hat CookD
## 317 -1.083660122 0.218460028 1.047011e-02
## 1084 -0.001454763 0.585381756 9.600931e-08
## 1221 2.587546650 0.056571533 4.736500e-02
## 572 2.737863674 0.006265692 1.060972e-02
## 834 3.071789049 0.004706782 1.934519e-02
## 973 1.810158045 0.199348469 3.794619e-02
if (!require(caret)) install.packages("caret")
## Loading required package: caret
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
library(caret)
test.probabilities <- predict(reg7, test.att, type = "response")
contrasts(test.att$Attrition)
## Yes
## No 0
## Yes 1
predicted.Attrition<- ifelse(test.probabilities > 0.5, "Yes", "No")
mean(predicted.Attrition == test.att$Attrition)
## [1] 0.8934169
table(predicted.Attrition, test.att$Attrition)
##
## predicted.Attrition No Yes
## No 267 24
## Yes 10 18